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METHOD AND CIRCUIT FOR NORMALIZATION 03? FLOATING POINT 
SIGNIFICANTS IN A SIMJJ ARRAY MPP 

FIELD OP THE INVENTION 

[0001] The present invention relates to the field of massively parallel processing 
systems, and more particularly to a method and apparatus for efficiently normalising and 
aligning ifec«g*tffican<l portion of a. Boating point number m a single instruction mute 
data massively parallel processing system. 

BACKGBjOUND OF THE INVENTION 



[ 0002] The following application is related to application serial number 09A 

Sled on , entitled "Method and Circuit for Alignment of Floating Point 

Significands in a SIMD Array the disclosure of which is incorporated by 

reference. 



[0003] The fundamental architecture used by all personal computers (PCs) and 
workstations is generally known as the von Neumann architecture, illustrated in block 
diagram form in Fig. 1. In the von Neumann architecture, a. main central processing 
unit (CPU) IQ is coupled via a system bus 11 to a memory 12. The memory 12, 
referred to herein as "main memory", also contains the data on which the CPU 10 
operates. la modern computer systems, a hierarchy of cache memories is usually built 
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into the system to 
memory 12. 



reduce the amount of traffic beoveea the CPU 10 and the main 



[0004] The von Neumann approach b adequate for low to medium performance 
application particularly when some system functions can be accelerated by special 
purpose hardware (e.g., 3D graphic* accelerator, digital signal processor (DSP), video 
encoder or decoder, audio or music processor, etc.). However, the approach of adding 
ace-derator hardwe is limited by the fc« 1 dwidth of the link from the CPU/memory 
part of the system to the accelerator. The approach may be further limited if the 
bandwidrh h shared by more than one accelerator. Thus, tiie processing demands of 
large data sets ore not served well by the von Neumann arcMttcturc. Similarly, as the 
process become* more complex and the data larger, the processing demands may not 
be met even with the conventional accelerator Approach. 

[000S] Referring now to Fig. 2, an alternative to the von Neumann architecmre is 
the single instruction multiple data (SIMD) massively parallel process (MPP) system. 
AMPP system differs from a von Neumann system by using a large number of 
processors, called processing elements (PE) 200, coupled to a communications network 
IS. The commutations network IS permit each PE 200 to exchange data with other 
pEs 200. Additionally, the PEs 200 may read or write to main memory 12 via an array- 

W-memory bus 13, or receive commands or instructions from CPU 10 via bus 11. 

Although the CPU 10 may perform some processing, in a SIMD MPP *y*tem> ^ 
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of PEs 14, comprising the PEs 200 and its communications network 15, perform most 
of the computations. The CPU 10 functions in a supporting role. 

[0006] In a SIMD MPP, each PE operates on the same instruction, at the same 
time, but on different pieces of data. Since the PEs in * SIMD array operate in lockstep, 
data dependent conditional operations cshnot be performed by branching, as would be 
done in a conventional processor. Instead, each PE can decide whether to store the 
result ofan operation either in an *at«mal register or in a memory dependent upon a 
condition generated within the PE from data local to the PB. This technique h known 
as "activity control" and is a very powerfbl method for performing data dependent 
decisions in a parallel computer which operates on a single stream of instructions. 

[0007] Most SIMD MPPs utilize relatively ample processors for PES 200. For 
example, short integer PEs 200, such as 8-bit integer processors may be used. SIMD 
MPPs utilize these simple processors in order to increase the number of PEs 200 which 
can be integrated upon a single silicon die. High performance is achieved by the use of 
a large number of simple PEs 200, each operating at a high dock speed. 

[0008] The use of short integer PEs 200 mean that floating point operations may 
require several clock cycles to complete. In many computer systems, floating point 
numbers are often Stored in a manner consistent with the IEEE-754 standard. In 
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particular, the IEEE-754 standard stores single precision floating point number as three 
binary fields taking the format of: 

(-l)'x2«- mj x(l.f) <*> 

s is a single bit representing the sign of the floating point number. 

c is an S-bk unsigned integer representing a biased eapoaent. e Li 
said to represent a biased c^oaeht*ecause the aetoat <a^e»t b«*g 
represented * equal to e - 127. Although an 3-bit unsigned integer may 
range fiom 0-255, and thereby permitting exponents in the range from -127 
(Le., -127 - 0 - 127) to +12S (i.e., 128 - 255 - 127), the IEEE-754 
standard limits the range of usable exponents to exclude -127 and +128. 



l.f is a 24-bit sifiniEcand field in a "normalised* format, i.e., a bit 
field in which the most significant bit (MSB) is the first digit left of the 
binary point and in which the most significant bit is set to one. Since the 
most significant bit of a normalized number is underwood to be 1, there is 
no need to store the most significant bit. 



[00091 



Data which have biased exponents of 0 and 255 are used to represent 
special conditions and the number zero. The IEEE-754 standard represents the 
number zero using a biased exponent of 0 (Le., for the single precision format, the 
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exponent equals 

-127) and a significand field of 000000000000000000000000 2 . (In the specif cases of 
zero and non-normalized numbers, indicated by the exponent being 0, the most 
significant bit of the significand is not taken to be a 1 .) 

[0010] Under the IEEE-754 standard, single extended, double, and double 
extended precision numbers are stored in similar format, albeit using different sized 
exponents and &gruficands. J?pr example, double precision numbers use a 10-brt biased 
exponent field with repmsentable exponents ranging from -1022 to 1023 and a 
significand having 53 bits. 

[0011] In order to perform arithmetic operations on floating point number stored 
in the format, the floating point numbers first need to be separated, or 

"demerged", to extract the sign bit, the exponent, and the significW, Once these 
fields have been extracted, they can be operated upon in order to perform the arithmetic 
operation. For example, multiplying two floating point number includes multiplying 
the significands and adding the exponents. Once the arithmetic operation has been 
performed, significand field of the result may not be in a normalized format. For 
example, multiplication of two operands with normalized significands ie suits in an 
answer ranging from 0 2 to 100 3 . The process of re turning; a significand field back to a 
normalized format is known as normalization. 
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[0012] In conventional computer systems, normalization is normally performed 
using standard shiftirtg logic, such as barrel shifters. Shifting logic is used in 
conventional computer systems because they have adequate speed and they do not 
consume a significant amount of silicon real estate in comparison to the other circuitry 
in a complex CPU 10. However, in a SIMD MPP using simple PEs 200, standard 
shifting logic such as barrel shifters ^ould significantly increase the size of the PEs 200 
and also be coo slow. According there is a desire and need for a way to efficiently 
perform normahzation of floating point sigriificanos in a SIMD MPP environment. 

SUMMARY OF THE INVENTION 

[0013] The present invention is directed at a processing element of a SIMD MPP 
which can efficiently perform the normalization processes commonly used when 
p^formins arithmetic operations on floating point numbers. The PEs of the SIMD 
MPP include two groups of registers. One of the groups is known as the M block and 
include* a plurality of registers and logic which permits limited ri ght shifting of the 
conrents of the registers. The other group of registers is known as the Q block and 
includes a plurality of registers and logic which permits limited left shifting (e.g.» 1-, 2-, 
4-, and 8- bit left shifts are supported) of the contents of the registers. A method is 
used with the limited left shifting ability of the Q block registers to normalize the result 
of an arithmetic calculation. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



[0014] 



The foregoing and other advantages and features of the invention will 



become more apparent from the detailed description of the preferred embodiments of 



the invention given below with reference to the accompanying drawings in which: 



compete* system; 

[0016] FIG. 2 is a block diagram of a SIMD MPP computer system; 

[0017] FIG. 3 is a block diagram of one of the PEs in the SIMD MPP computer 
system in accordance with the principles of the present invention; 

[0018] FIGS. 4A and 4B are a flow chart which Uhisrrate how die PE of the 
present invention aligns significand data; and 

[00L9] FIG. 5 i$ a flowchart which illustrates how the PE of the present invention 
normalizes significand data. 

DETAILED DESCRIPTION OF THE INVENTION 

[0020] Now referring to the drawings, where like reference numerals designate like 
elements, thexe is shown in Fig. 3 a block diagram of a PE 200 in accordance with the 



[0015] 



FIG. 1 is a block diagram of A prior art von Neumann architecture 
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principles of the present invention. The PE 200 is divided into several functional 
blocks, including an ALU 301, which is coupled to a Node Communications Interfiles 
305 and a DRAM Interface 303. The Node Communications Interface 305 is used by 
the PE 200 to send and receive messages to the four other PE 200 adjacent to the 
present PE 200, over signal lines 306a, 306b, 306c, and 306d. The DRAM Interface 
303 is used by the PE 200 to read and write to a main memory 12. The ALU 301 is 
also coupled to a scries of registers, including a register file 302 used to store data, a 
series offlag register* and a shift control agister ("SCR") 360. In the exemplary 
embodiment, the SCR 360 is an S-bfc register with the most significant bit designated 
bit 7 and the least significant bit designated fait 0. The function of the flag registers 307 
and the SCR 360 will be explained latex. The PE 200 also includes two registers blocks, 
namely the M Block 350a and the Q Block 350b. 

[0021] The M block 350a includes a bus called the M Bus 307a which is coupled 
to the Node Communications Interface 305. The M bus 307a is also coupled, via logic 
circuit 308a to a plurality of registers. These registers include the M3 310, M2 311, 
Ml 312, M0 313, and MS 314 registers. In some embodiments an optional a G 
register 320 may also be present. The G register 320 may be used, for example, to store 
extension bits for use in higher precision calculations. In one e xe mp lary embodiment, 
registers M3, 310, M2, 31 1, Ml 312, and M0 313 are 8-bit registers while register MS 
314 is a single bit register. Logic circuit 308b couples registers M3 310, M2 311, Ml 
312, M0 313, MS 314, and G 320 to Q Bus 307b, ALU 301 and DRAM Interfile 
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304. The logic dienits 308a and 308b represent conventional logic circuits such as a 
network of multiplexers, which permit the registers M3 310, M2 311 „ Ml 3X2, M0 



described in additional detail. 

[0022] Additionally, logic circuits 308a, 30Sb are also capable of demerging an. 
1EEE-754 formatted number into its sign, biased exponent, and sdgnificand fields. In 



310, and the significand is stored in registers M2 311 (most significant byte), Ml 312, 
and M0 313 {least significant byte). The logle circuits 308a, 308b may also be capable 
of setting registers M2 311, Ml 312, and M0 313 CO zero. Finally, logic circuits 308a, 
308b also permit data stored in registers M2 311 and Ml 312 to be right shifted in 
Increments of 1, 2, 4, and 8 bits. The M registers (i.e., MS 314, M0 313, Ml 312, M2 

311, and M3 310) and the Q registers (i.c, QS 334, Q0 333, Ql 332, Q2 331 , and 
Q3 330) are coupled via signal line 307c This permits the contents of the M registers 
to be transferred in one dock cycle to corresponding Q registers in the Q block. 

[0023] The Q block 350b is similar to the M block 350a. The Q block has an bus 
known as the Q bus 307b. The Q bus 307b is not coupled to the Node 
Communications Interfice 305. Instead, the Q bus 307b is coupled Via signal line 307c 
to the M Bus 307a of the M block 350a. The Q block 350b include a series of Q 
registers, namely QS 334, QO 333, Ql 332, Q2 331, and Q3 330. In the exemplary 



313, MS 314, and G 320 to receive and transmit data in a manner which will be 



particular, the: saga is stored ia register MS 3 14, the biased exponent is stored in. M3 
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embodiment register QS is a single bit register while registers QO 333, Ql 332, Q3 
331, and Q3 330 are 8-bit register*. The Q block 3S0b has logic circuits 309a, 309b 
which function in a manner similar to logic circuits 308a, 308b of the M block 350a. 
One significant difference between the two sets of logic circuits, 308a/30Sb and 
30°a/309b, however, is that while logic circuits 308a, 308b permit data stored in 
registers M2 and Ml to b« right shifted in 1, 2, 4, and 8 bit increments, logic circuits 
309.a, 309b permit data in registers Q2 331 and Ql 332 to be left shifted, in the same 
increments. 

[0024] The PE 200 also includes a flag register 307 which contain a plurality of 
flags. These flags default to being set to zero, unless a specific conditions reseis them to 
one. In the exemplary embodiment there are four flags named Q2Z8, Q2Z4., Q2Z2, 
and Q2Z1, which firaction as described below. Flag Q2Z8 is one if all eight bits of 
register Q2 331 are zero. Flag Q2Z4 is one if the four most significant bits of register 
Q2 331 are zero. Flag Q2Z2 is one if the two most significant bits of register Q2 331 
are both zero. Finally, flag Q2Z1 is one if the most significant bit of register Q2 331 is 
zero. 

[0025] The 200 performs floating point arithmetic operations by first 
demerging the two IEEE-754 formatted operands. This is done by loading the first 
operand into the M block 350a. The Operand may be loaded from the Node 
Communications Interface 305 if the operand is sent from an adjacent PE 200. 
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Alternatively, the operand may be loaded from the DRAM Interface 303 if the operand 
had been loaded into the main memory 12. As mentioned previously, the logic circuits 
308a, 308b in M block 350a demerge an IEKE-754 formatted operand into it* sign, 
biased exponent^ and significand fields by storing the sign field in register MS 314 3 the 
biased exponent in register M3 310, and the signiiicand in registers M2 311 and Ml 
312, Once the first operand has been demerged, it Is transfejxed via signal line 307c to 
the Q block 350b. The second operand is then loaded to the M block 350a and 
demerged. At this point, die two demerged successive operands arc in the M block 
350a and the Q block 350b, 



[0OK5] The ALU 301, which is coupled to the M block 350a via logic circuit 308b 
and the Q block 350b via logic circuit 309b > is used to perform the arithmetic operation 
in an ordinary manner, Por example, the significant may be added, subtracted, or 
multiplied. For addition and subtraction the exponents of the operands arc equal and 
do not require adjustment. For multiplication, the exponents are summed. The result 
of the arithmetic operation are stored in the Q block 350b. As usual, the most 
significant byte of the result is stored in register Q2, and lesser significant bytes of the 
results arc progressively stored in registers Ql and QO. If there are additional bits of the 
result which needs storing, the lesser significant byte? of the results may be scored in the 
G register 320 (if present) and the M0 register 313 of the M Block 350, and additional 
lesser significant bytes of the results may be stored in the register file. 



I00Q7157 21 -May- 01:1 \7W<\ 



21/05 *01 MON 10:10(Mt 01159 552201 ERIC P OTTEmBakWuN @019 



3 



13 

[0027] After performing the arithmetic operation, the significant! ro^y not be in 
normalized form. In order to comply with the EEEE-754 standard, the 5iEn1.fica.ad 
scored in the plurality of Q registers Q2 331 Ql 332 Q0 333 may need normalization. 
In general, the result Of an arithmetic operation may result in a slgnificand having a 
number of zeros (up to the level of precision, l.e*, up to 24 for IEEE-754 single 
precision arithmetic) at the most significant portion of the sig nifies nd . The 
normalization process shifts the significand so that the most significant bit (i.e.,. bit 7 of 
register Q2 331) is a one. 

[0028] The normalization of the significand is performed according to the 7 steps 
described below and illustrated in Kg. 5, steps 500-51 S: 

[0029] (Step 1) Set a temporary variable, such as one of the registers in the register 
file 302 to zero (Fig* 5> 501). 

[0030] (Step 2) If flag Q2Z8 Is equal to one (Fig, S, 502), shift the result to 

the left by eight bits and add 8 t» the temporary variable (Fig. 5, 503). 

[0031] (Step 3) If flag Q2Z8 is equal to one (Fig. 5, 504), left shift the result by 8- 
bits and add 8 to the temporary variable (Fig, 5, 505). 

[0032] (Step 4) If flag Q2Z8 is equal to one (Kg. 5, 506), left shift the result by 8- 
bits and add 8 to the temporary variable (Fig. 5, 507). 
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[0033] (Step 5) If flag Q2Z4 is equal to one (Fig. 5, 508) > left shift the result by 4- 
bits and add 4 to the temporary variable (Figi S a 509)- 

[0034] (Step 6) If flag Q2Z2 i$ equal to one (Fig, 5, 510), left shift the result by 1- 
bits and Add 2 to the temporary variable (Fig, 5, 511), 

[0035] (Step 7) If flag Q2Z1 is equal to one (Eg. 5> 512), left shift the result by 1- 
bit and add 1 to the temporary variable (Fig, S 5 SI 3). 

[0036] (Step 8) The exponent of the result is adjusted by subtracting the 
temporary variable from the exponent. I.e., Q3 - Q3 - temporary variable (Pig. 5 a 
514), 

[0037] Note that as the shifting is performed in the Q registers Q2 331 Ql 332 
Q0 333, the contents of the G register 320 is being shifted into register QO. Likewise 
the contents of the M0 313 register is being shifted Into register G 320. 



[0038] For example, suppose in one of the PEs 200 of the array 14, the Q Block 
350b registers (Q3 330> Q2 331, Ql 332, and Q0 333) contain the following values: 



Q3 


Q3 




02 


0000 1000 


0001 0101 


looi looi 


0000 1111 
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[0039] Normatotion is performed as follows: In step (1), a temporary variable is 
set to zero. The tciupoi^y variable may be 3 register from the register file 302, a 
memory location accessed via the DRAM Interfax 304, or any other temporary storage 
location. The content of the registers, flags, and temporary variable after step (1) are as 
follows: 



OS 


02 


pi 


oo 


00001000 


0001 0101 


1001 1001 


0000 1111 





mm 


Q2Z2 


Q2Z1 ' 


Temp 


0 


0 


1 


1 


0 



[0040] In seep (2) since flag Q2Z8 is equal to zero so no further processing is 
performed in step (2)- The content of the registers, flags, and temporary variable after 
step (2) are as follows: 







Ol 


oo 


0000 1000 


0001 0101 


1001 1001 


0000 1111 



Q2ZS 


Q2Z4 


02Z2 


Q2Z1 


Temp 


0 


0 


1 


1 


0 



[0041 ] In step (3) since flag Q2Z8 is equal to zero, no further processing is 
performed In step (3). The content of the registers, flags, and temporary variable after 
step (3) arc as follows: 



Q3 


02 


9.1 


qo 


0000 1000 


0001 0101 


1001 1001 


0000 1111 
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Q2ZS 


Q2Z4 


Q2Z2 


02Z1 


Temp 


0 


0 


1 


1 


0 



[0042] Irt step (4), since flag Q2ZS is equal to zero, no farther processing is 
performed in step (4). The content of the registers, flags, and temporary variable after 
step (4) are as follows: 



OS 


m 


pi -J 


Q0 


0000 1000 




1001 1091 l 


0000.1111 



Q2Z8 


Q2Z4 


Q2Z2 


02Z1 


Temp 


0 


0 


1 


1 


0 



[0043] In $ttp (5), since flag Q2Z4 is equal to zero, no further processing 1$ 
performed in step (5). The content of the registers, flags, and temporary variable after 
Step (5) are as follows; 



Q3 


Q2 


Ol 


Q0 


0000 1000 


0001 0101 


looi iooi 


0000 1111 



Q2Z8 j Q2Z4 


Q2Z2 


Q2Z1 


T«mp 


0 !0 


1 


1 


0 



[0044] In step (6), since flag Q2Z2 is equal to one, the content of registers Q2, 
Ql, and Q0 are right shifted by 2-b5ts, and 2 is added to the temporary variable- The 
content of the registers, flags, and temporary variable after Step (6) axe as follows: 
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Q3 


Q2 


Ql 




0000 1000 


0101 0110 


0110 0100 


0011 uoo 



Q2Z8 


Q2Z4 1 Q2Z2 I Q2Z1 


Temp 


0 


0 1 o 1 1 


2 



[004S] In step (7), since flag Q2Z1 is one, the content of register Q2 7 Ql, and 
QO are right shifted by 1-bit, and I is added to the temporary variable. The content of 
the registers, flags, and temporary variable after step (7) arc as follows: 



Q3 


Q2 


Ql 


QO 


0000 1000 


1010 1100 


1100 1000 


0111 1000 



Q2Z8 


Q2Z4 


0.2Z2 


Q2Z1 


Temp 


0 


0 


0 


0 


3 



[0<H6] In step (8), the contents of the temporary variable (now 3) is subtracted 
from die exponent (which is held in register Q3), The contents of tht Q registers are 
now normalized and the state of the registers, flags, and temporary variable (at this 
point the temporary variable is no longer needed and may be used for other purposes) 
are as follows: 



03 


Q2 


Ol 


SP 


0000 0101 


1010 1100 


1100 1QO0 


Olll 1000 



Q228 


Q2Z4 


02Z2 


Q2Z1 


Tttnp 


0 


0 


0 


0 


3 
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[0047] Thus, the present invention provides an apparatus and * method for 
normalizing the signified portion of an floating point number, such as those which 
follow the IfiEE-754 floating point standard, in a SIMP MPP environment. The 
present invention, is advantageous in mat each PE 200 of the array 14 is not required to 
have a full feature shifter* such as a barrel shifter. Instead, a faster but more limited 
shifting logic, such as logic circuits 308a, 308b, which arc only capable of shifting the 
«gnjfiearid data by 1-, 2-, 4-, or &- bits are vsed in combination with a shift control 
register 360, under a nine step procedure to align the agrMcand, Ideally* the 
instruction or instructions which correspond to each of the nine steps can be executed 
by a PE 200 in a single clock cycle. Since in a SIMD environment each PE 200 in the 
array 14 executes the same instruction at the same rime, every sgnificand in the array 14 
can be aligned in as little as nine clock cycles. 

[0048] Although the invention has been discussed and illustrated in the context of 
a 8-bit shift control register and shifting circuits which arc capable of shifting signifkand 
data by 1-, 2-, 4-, and 8- bits, the invention is not so limited and may be generalized as 
follow^: The flexibility of the left shifting circuitry and the number of flags may be 
varied. The number of flags and the flexibility of the left shifting circuitry is related as 
follows. If there are F+l flags (wherein T is an integer of at lease 3), then the left 
shifting circuitry should be capable of left shifting the significant being normalized by 
2°,2\2*, or 2 P bits. 
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[0049] The generalized normalization procedure begins with the arithmetic logic 
unit setting CO ZtXO the value of temporary storage location. Each flag is then 
examined, beginning with flag V and ending with flag 0. For each flag which is equal to 
one, the arithmetic logic unit causes thi left shifting dicuitry to left shift the srignificand 
by 2 F bits and add 2 F to the value stored in the temporary storage location- After every 
flag ha$ been analyzed, the value stored in the temporary register is subtracted from the 
significand's exponent. 

[0050] While certain embodiments of the invention have been described and 
illustrated above, the invention is not limited to these specific embodiments as 
numerous modifications, changes and substitutions of equivalent elements can be made 
without departing from the spirit and scope of the invention. Accordingly, the scope of 
the present invention is not to be considered as limited by the specific? of the particular 
structures which have been described and illustrated, but is only limited by the scope of 
the appended claims. 
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CLAIMS 



[0051] What is claimed as new and desired to be protected by Letters Patent of the 
United States is; 



1 . A circuit having support for normalization of significands, comprising: 

a first register block, said first register block including at least one first 
register fer holding a first exponent and a first significand of a first floating point 
number wd a first logic capable of left shifting the significand of the first floating potot 
number; 

a second register block, said second register block including at least one 
second register for holding a second exponent and a second significand of a second 
floating point number; 

a plurality of flags coupled to said first register block, each of ssdd plurality 
of flags having a state based on the contents of said first significand; 

an arithmetic iogie unit coupled to said first register block, said second 
register block, and said plurality Of flags, said arithmetic logic unit causing the first logic 
to left shift the first significand based upon the states of said plurality of flags. 



2. 



The circuit of claim 1, wherein said plurality of flags further comprises: 
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an I 141 flag, wherein I is a non-negative integer, said I* flag which is set to 
a first s*tc when the 2 1 most significant hits of said first $ignificand are each zeros and a 
second state if any of the 2 E most significant bits is non-zero. 

3. The circuit of claim 2, wherein said arithmetic logic unit causes said first 
logic to JUit shift by 2 1 bits the first significand if said I* flag i$ set to the first state. 

4. The circuit of claim. 3* wherein said arithmetic lo^c unit is coupled to a 
temporary storage location for storing an adjustment to be subtracted from said first 
exponent, and increments said adjustment by 2 1 if said first Sag is set to the first state. 

5. The circuit of claim 2, wherein lis 0- 



6. The circuit of claim 2, wherein I is 1. 



7. The circuit of claim 2, wherein I is 2„ 



8. The circuit of claim 2, wherein I is 3. 

9. The circuit of claim 1, wherein said arithmetic logic unit is coupled to a 
temporary storage location. 
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10* The circuit of claim 9, wherein said temporary storage location is a 

register in a register file. 

11, The circuit of claim 9, wherein said temporary storage location is a main 



an I* flag, wherein I is a positive Integer of at least 3, which is set to a first state 



state if any of the 2 1 most significant hits of said first significand is non-zero; 

an (1-1)* fbg which is set to a first state xvhen the 2 <r-1) most dgnlficsint bits of 
said first significand arc each zeros and a second state if any of the 2 n * J) most significant 
bits of said first significand is non-^cro; 

an (1-2)* flag which is set to a first state when the 2 (I "*> most significant bits of 
said first significand are each zeros and a second state if any of the 2^ 2> most significant 
bits of said first significand is rXOn-zero; and 

an (1-3)* flag which is set co a first state when the 2 tM) significant bits of said first 
signifi cand are each zeros and a second state if the 2 a ' S) significant bits of said first 
significand is non-zero; and wherein 



memory accessed through a memory interface. 



12. The circuit of claim 1, wherein: 



said plurality of flags further comprises, 



when the 2* most significant bits of said first significand are each zexo9 and a second 
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said arithmetic logic unit is coupled to a temporary storage location, said 
arithmetic logic unit initially setting the temporary storage location to zero, then 
modifying said temporary location based upon the state of the plurality of flags, and 
finally modifying said first exponent based on the content* of said temporary location. 

13- The circuit of claim 12, wherein said temporary storage location is a, 
register in a register file. 

14, The circuit of claim 12, wherein said temporary storage location is a main 
memory accessed through a memory interface. 

15. The circuit of claim 12 wherein said arithmetic logic unit modifies the first 
exponent by subtracting the contents of said temporary location from Said first 
exponent. 

16- The circuit of claim 12 3 wherein 1 is equal to 3^ 

17, A massively parallel processing system, comprising: 
a main memory; 

an array of processing elements, each processing element of the array being 
coupled to said main memory and other processing elements of said arxay> wherein each 
Of said processing elements comprises, 
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a first register block, said first register block including at least one first register for 
holding a first exponent and a first significand of a first floating point number Jtnd a Erst 
logic capable of left shifting the signifies jid of the first floating point number; 

a second register block, said second register block including at least one second 

o 

register for holding a second exponent and a second significand of a second floating 
point number; 

a plurality of flags coupled to said first register block, each of said plurality of 
flags having a state based on the contents of said first significand; 

an arithmetic logic malt coupled to said first register block, said second register 
block, and Said plurality of flags, said arithmetic logic unit causing the first logic to left 
shift the first significand based upon the states of $aid plurality of flags. 

18. The massively parallel processing system of claim 17* wherein said 
plurality Of flags further comprise*; 

an I* flag, wherein I is a non-negative integer, aaid I th flag which is set to 
a first sate when the 2 1 most significant bits of said first significand are each zeros and a 
second state if any of the 2 1 most significant bits is non-zero. 

19. The massively parallel processing system of claim IS, wherein said 
arithmetic logic unit causes said first logic to left shift by 2 1 bits the first significand if 
said I* flag is set to the first State. 
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20- The massively parallel processing system of claim 19, wherein said 
choleric logic unit is coupled to a temporary storage location for storing an 
adjustment to be subtracted from said first exponent, increments said adjustment by 
2 l if said first flag is set to the first state. 

21. The massively parallel processing system of claim 18, wherein I is 0. 

22. The massively parallel processing system of claim 18, wherein I Ss 1. 

23. The massively parallel processing system of claim 18, wherein lis 2. 

24. The massively parallel processing system of claim 18, wherein I is 3. 

25- The massively parallel processing system of claim 17, wherein said 
arithmetic logic unit is coupled to a temporary storage location. 

26. The massively parallel processing system of claim 25, wherein suid 
temporary storage location is a register in a register file 

27. The massively parallel processing system of claim 25, wherein said 
temporary storage location is a main memory accessed through a. memory interface. 

28. The massively parallel processing system of claim 17, wherein: 
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said plurality of flags further comprises, 

an I* flag, wheiein I is a positive integer of at least 3, which is set to a first state 
when the 2 J most significant bits of said first significand are each z*xos and a second 
state if any of the 2 1 most significant bits of said first significand is nonzero; 

a (1-1)* flag which is set to a first state when the 2 ai > most significant bits of said 
first significand are each zeros and a second state if any of the 2< T " 1 > most significant bits 
of said first significand is non-zero; 

a. (1-2)* flag which is set to a first state when the 2<™> most significant bits of said 
first significand Are e*ch zeros and a second state if any of the 2 a i) most significant bits 
of said first significand is non-zero \ and 

a (1-3)* flag which is set to a first state when the 2^ 3) most significant bits of said 
first significand arc each zeros and a second state if the 2 cc " a) significant bits of said first 
significand is non-zero; and wherein 

said arithmetic logic unit is eotipled to a temporary storage location, said 
arithmetic logic unit initially setting Ac temporary storage location to zero, then 
modifying said temporary location battd Upon the state of the plurality of flags, and 
finally modifying said first exponent based on the contents of said temporary location. 

29. The massively parallel processing system of claim 28, wherein said 
temporary storage location is a register to a register file. 
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30. The massively parallel processing system of claim 28, wherein said 
temporary storage location Is a main memory accessed through a memory interface. 



31. The massively parallel processing system of claim 17 wherein said 
arithmetic logic unit modifies the first exponent by subtracting the contents of said 
temporary location torn said first exponent. 

32. The massively parallel processing system of claim lS 5 wherein 1 is equal to 

3. 

33. A method for normalizing the significand of a floating point number 
Stored in a processing element having an exponent rcg?5ter 3 a plurality of significand 
registers, an I th flag indicating whether the 2 r most significant bits of the significand are 
each zero* a (I-I)* flag indicating whether the 2 1 most significant bits of the significand - 
are each zero, a (1-2)* flag indicating whether the 2 p-2) most significant bits of the 
significand are each zero, a (1-3)* flag indicating whether the 2 CE ' 3 > most significant bit of 
the significand is zero, and a temporary variable, wherein I is an integer of at least 3 7 

said, method comprising the step of. 

(a) iwtralrrMg the temporary variable to zero; 

(b) if said r* 1 flag is set, left shifting the significand by 2 1 bits and incrementing 
the temporary variable by 2 1 ; 
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(c) if ssdd (M)* flag is set, left shifting the signified by 2< M > hits and 



(d) if said (1-2)* flag is set, left shifting the significand by 2 (tl) bits and 



(e) if said (1-3)* flag is set, left shifting the significand by 2 (w > bit and 
incrementing the temporary variable by and 



34, The method of claim 33, wherein I is equal to 3. 



35. The method of claim 33, wherein step (a) is performed before step (b). 



36, The method of claim 43, wherein step (c) is performed after step (b). 



37. The method of claim 44» wherein seep (d) is performed after step (c). 



38. The method of claim 45, wherein step (e) is performed afte* step (d). 



39. The method of claim 46, wherein step (f) is performed after step (c). 



incrementing the temporary variable by 2* M *; 



incrementing the temporary variable by 2®'^ 



(f) 



decrementing the exponent register by the value of the temporary 



variable. 
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ABSTRACT 

[0052] The processing elements of a single instruction multiple: data (SIMD) 
massively parallel processor (MPP) are provided with two register blocks- One register 
block includes logic for performing limited left shifting, while the ether register block 
includes logic for performing limited right shifting. A method is disclosed for using the 
registers blocks with their associated logic to perform floating point significand 
alignment and jAOCmalizatiOD. The limited shifting logic occupies less die space than a 
full feature barrel shifter, thereby permitting a greater number of processing elements. 
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